Li S, Xiao T, Li H, et al. Person search with natural language description[C]//Proc. CVPR. 2017.

1. Overview

1.1. Motivation

existing methods mainly focus on searching persons with image-based or attribute-based queries (limitations for a practical usage)
no person dataset or benchmark with textual description available

In this paper, it studied person search with natural language description

proposed Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN)
collect CUHK Person Description Dataset (CUHK-PEDES)

1.2. Image-Based Query

person re-identification which requires at least one photo of the queried person being given.

1.3. Attribute-Based Query

pre-defined semantic attributes which have limited capability of describing persons’s appearance. And labeling exhausted set of attributes is expensive.

1.4. Contribution

person search with language is more practical for real-world
investigate different solution. image caption, VQA, visual-semantic embedding
proposed GNA-RNN

1.5.1. Language Dataset for Vision

Flickr8K, Flickr30K
MS-COCO Caption
Visual Genome
Caltech-UCSD
Oxford-102 flowers

1.5.2. Deep Language Models for Vision

Image Caption. NeuralTalk
VQA. Stacked Attention Network
Visual-Semantic Embedding

1.6. CUHK-PEDES Dataset

40,206 images of 13,003 persons from five existing person re-identification datasets (CUHK03, Market-1501, SSM, VIPER, CUHK01)
80,412 sentences for 40,206 images (2 sentences/img). details about appearance, actions, poses and interaction with other objects
high-frequency word

1.6.1. User Study

Language vs Attribute.
language description are much precise and effective in describing persons than attributes. (top-1: 58.7% vs 33.3%; top-5: 92.0% vs 74.7%).
Sentence Number and Length
3 sentences achieve highest retrieval accuracy. the longer the sentences are, the easier for users to retrieve the correct images.
Word Types

nouns provide most information followed by the adjectives, while the verbs carry least information.

2. GNA-RNN

key to build word-image relations. given each word, search related regions to determine whether the word with its context fit the image
confidences of all relations should be weighted and then aggregate to generate the final sentence-image affinity

2.1. Visual Units

Input. resize to 256x256
Output. 512 visual units
pre-trained on our dataset for person classification based on person IDs
during jointlt training, only update cls-fc1 and cls-fc2

2.2. Attension over Visual Units

word are encoded into K-length one-hot vectors. K is the vocabulary size
embedded one-hot vector and concat with image features
through LSTM-FCs-Softmax, generate unit-level attention at each word
summation of weighted attention image features
summation of all T words

2.2.1. LSTM

h. tanh

2.3. Word-Level Gates for Visual Units

different words carry significantly different amount of information for obtaining language-image affinity. (“white” should be more important than the word “this”)
unit-level attention can not reflect such difference
learn word-level scalar gates at each word

2.4. Loss Function

2.5. Details

SGD
positive:negative=1:3
batch size 128
all FC are 512 units except gate-fc1

3. Experiments

3.1. Dataset

training set. 11,003 persons; 34,054 images; 68,108 sentence descriptions
testing set. 3,074 images of 1,000 persons
validation set. 3,078 images of 1,000 persons

3.2. Comparison

LSTM might have difficulty encoding complex sentences into a single feature vector
word-by-word processing and comparison might be more suitable for the person search problems
RNN is more suitable in processing natural language data

3.3. Ablation Study

initial training affects the final performance a lot

3.4. The Number of Visual Unit

more units might over-fit the dataset

1. Overview

1.1. Motivation

1.2. Image-Based Query

1.3. Attribute-Based Query

1.4. Contribution

1.5. Related Work

1.5.1. Language Dataset for Vision

1.5.2. Deep Language Models for Vision

1.6. CUHK-PEDES Dataset

1.6.1. User Study

2. GNA-RNN

2.1. Visual Units

2.2. Attension over Visual Units

2.2.1. LSTM

2.3. Word-Level Gates for Visual Units

2.4. Loss Function

2.5. Details

3. Experiments

3.1. Dataset

3.2. Comparison

3.3. Ablation Study

3.4. The Number of Visual Unit